New Algorithms for Heavy Hitters in Data Streams (Invited Talk)

نویسنده

  • David P. Woodruff
چکیده

An old and fundamental problem in databases and data streams is that of finding the heavy hitters, also known as the top-k, most popular items, frequent items, elephants, or iceberg queries. There are several variants of this problem, which quantify what it means for an item to be frequent, including what are known as the l1-heavy hitters and l2-heavy hitters. There are a number of algorithmic solutions for these problems, starting with the work of Misra and Gries, as well as the CountMin and CountSketch data structures, among others. In this survey paper, accompanying an ICDT invited talk, we cover several recent results developed in this area, which improve upon the classical solutions to these problems. In particular, with coauthors we develop new algorithms for finding l1-heavy hitters and l2-heavy hitters, with significantly less memory required than what was known, and which are optimal in a number of parameter regimes. 1 The Heavy Hitters Problem A well-studied problem in databases and data streams is that of finding the heavy hitters, also known as the top-k, most popular items, frequent items, elephants, or iceberg quries. These can be used for flow identification at IP routers [21], in association rules and frequent itemsets [1, 25, 26, 44, 47], and for iceberg queries and iceberg datacubes [7, 22, 24]. We refer the reader to the survey [18], which presents an overview of known algorithms for this problem, from both theoretical and practical standpoints. There are various different flavors of guarantees for the heavy hitters problem. We start with what is known as the l1-guarantee: Definition 1 (l1-(ǫ, φ)-Heavy Hitters Problem) In the (ǫ, φ)-Heavy Hitters Problem, we are given parameters 0 < ǫ < φ < 1, as well as a stream a1, . . . , am of items aj ∈ {1, 2, . . . , n}. Let fi denote the number of occurrences of item i, i.e., its frequency. The algorithm should make one pass over the stream and at the end of the stream output a set S ⊆ {1, 2, . . . , n} for which if fi ≥ φm, then i ∈ S, while if fi ≤ (φ − ǫ)m, then i / ∈ S. Further, for each item i ∈ S, the algorithm should output an estimate f̃i of the frequency fi which satisfies |fi − f̃i| ≤ ǫm. We are interested in algorithms which use as little space (i.e., memory) in bits as possible to solve the l1-(ǫ, φ)-Heavy Hitters Problem. We allow the algorithm to be randomized and to succeed with probability at least 1 − δ, for 0 < δ < 1. We do not make any assumption on the ∗A preliminary version of this paper appeared as an invited paper in ICDT 2016.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

New Algorithms for Heavy Hitters in Data Streams

An old and fundamental problem in databases and data streams is that of finding the heavy hitters, also known as the top-k, most popular items, frequent items, elephants, or iceberg queries. There are several variants of this problem, which quantify what it means for an item to be frequent, including what are known as the `1-heavy hitters and `2-heavy hitters. There are a number of algorithmic ...

متن کامل

Hashing Pursuit for Online Identification of Heavy-Hitters in High-Speed Network Streams

Distributed Denial of Service (DDoS) attacks have become more prominent recently, both in frequency of occurrence, as well as magnitude. Such attacks render key Internet resources unavailable and disrupt its normal operation. It is therefore of paramount importance to quickly identify malicious Internet activity. The DDoS threat model includes characteristics such as: (i) heavy-hitters that tra...

متن کامل

Implementing Hierarchical Heavy Hitters in RapidMiner: Solutions and Open Questions

Huge masses of data and potentially infinite data streams pose big challenges to methods in data mining that analyse data off-line and in several passes. In the area of intrusion detection, algorithms that detect characteristical patterns in system call data could have to process several hundred megabytes of data per minute. We describe a plugin for the aggregation of data streams by determinin...

متن کامل

An Optimal Algorithm for `1-Heavy Hitters in Insertion Streams and Related Problems

We give the first optimal bounds for returning the `1-heavy hitters in a data stream of insertions, together with their approximate frequencies, closing a long line of work on this problem. For a stream of m items in {1, 2, . . . , n} and parameters 0 < ε < φ 6 1, let fi denote the frequency of item i, i.e., the number of times item i occurs in the stream. With arbitrarily large constant probab...

متن کامل

On Low-Risk Heavy Hitters and Sparse Recovery Schemes

We study the heavy hitters and related sparse recovery problems in the low-failure probability regime. This regime is not well-understood, and has only been studied for non-adaptive schemes. The main previous work is on sparse recovery by Gilbert et al. (ICALP’13). We recognize an error in their analysis, improve their results, and contribute new non-adaptive and adaptive sparse recovery algori...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016